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ABSTRACT 


Three hundred~fifty-two standardized tests of reading 
comprehension and 373 standardized vocabulary measures were analyzed 
in terms of a number of criteria related to psychometric quality and 
‘educational ability. The criteria were baséd primarily on the 
Standards for Educational and Psychological Tests developed by the 
American Psychclogical Associaticn, fhe American Educational Research 
Association, and tke National, Council on Measurement in Education, 
and the MEAN test evaluation Systen developed by the Center for the’ 
Study of. Evaluation. Fewer than 10 percent of the tests reported < 
reliability and validity coefficients sufficiently high to make then 
appropriate for use by researchers or evaluators. Approximately 30 to 
40 percent of the tests reported good raw score distribution 
characteristics and useful converted:scores. Very few instruments had 
nationally representative norm samples or useful information for 
decision makifig about pupils. Given the quality of many 
"off-the-shelf" instruments, researchers and evaluators should have a 
wariety of alternate. measurement strategies at hand. (Author/MV) 
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» ABSTRACT 
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/ ; The Quality of High School Reading and 
/ Vocabulary Tests: Implications for the Researcher 


: : Joseph M. Pettosko 
\ : University of Louisville 


Pindardiived tests in the areas of Reading Comprehension 
(N=352) and Vocabulary (N=373). were analyzed on a number of criteria 
related -to educational and psychometric gentarys For many criteria 
related to validity and reliability, fewer than 10% of the tests’ 


‘reported correlations sufficiently high enough to make them strong 


. 


candidates for use by a researcher or evaluator. A moderately large 
percentage (in the area of 30 to AG sedan) of tests had good raw 
_score distribution characteristics-and useful converted, scores. Very 
few tests had nationally representatiye norm samples or useful 
information for decision-making about pupils. The researcher needs 


to have a variety of measurement strategies in mind-to compensate 


for the weaknesses of many "off-the-shelf" instruments. 


In 1974, a large scale project: was completed that involved a quality assess- 
ment of all published standardized tests aimed at secon ary. leve] students (Hoepf- 
ner, Conniff, Petrosko, Watkins, Erlich, Todaro, Hoyt, MsGuire, Klibanoff, Stangel, 
Lee, Rest, Hufano, Bastone, Ogilvie, Hunter, & Johnson,.1974). Approximately 5,400 
‘tests (or ‘subtests of larger test batteries) were subjected to a-detatied éveluation 
procedure. @Tests were rated on many criteria of psychometric and educational 
quality. For the most part, the criteria well represented the concerns expressed in 
the Standards for Educational and PSychological Tests (Joint Committee of the 
American Psychological Assocation, American eiucteiome’ Research Association and 
National Council on Measurement in Education, 197 

This large body of daya allowed many types{f of comparisons to be made regarding 
current tests. Petrosko/and Hufano (1975) examined the quality of high schod1 

, * 
Mathematics tests. S| ni and Petrosko (1976) used the entire set of ratings to 
develop a theory to/guide the evaluation of standardized tests. The present study 
focuses on the quélity of high school tests in Reading (mprehension and Vocabu-_ 


lary. mf 


The obje€tives of the study were: 1) to report on the general level of quality 


/ ; " 


of Reading And Vocabulary tests 2) to explore the implications of these findings for 


{ F 
test userS, especially researchers, 
y ; 


, Method: 
How skate were evaluated’ 


' A detailed description of evaluation procedures is contained in Hoepfner et al. 


(1974). The following describes, in brief, the process that was followed 


we 


| Following a canvass of te$t catalogs and test publishers, all tests suitable or 


yecommended for secondary students (except ¢linical and projective measures) were 
PC pr 2 - ° 


ordered. For each test, evaluators decided if the instruments would be evaluated 


. ) 


| 


in whole or in parts. | A subtest was evaluated if it yielded a separate score which : 
the publisher or the organization of the test itself clearly indicated could be inter- - 
preted separately. Using this rule, a test was evaluated: 1) as a whole and for 
each of the subtests, or 2) only as a whole, or 3) only for the subtests. 

Each test and subtest was categorized by grade level according to the claims 
or directions of the publisher. In the absence of such information, test evaluators 
estimated grade levels according to common curriculum sequences and item difficulties. 
Tests were assigned to one or more of three separate categories: 7-8, 9-10, or 11-12. 


e 
Those tests that spanned categories (e.g. some tests were labeled "high school" 


and intended for grades 9 through 12) were evaluated for each grade combination and 
reported separately at each level. 


Two raters independently assigned each test or subtest to one of 298 categories - 


234 goals subsumed under 6 more general goals. Developed after consulting textbooks, 


a 


curriculum guides, journal articles, and other publications, the goals constituted 


a, comprehensive taxonomy of secondary education in terms Of student outcomes. The 


wide ranging collection included traditional subject-matter areas (e.g. goals in 


} 


English, Mathematigs, and Science), Vocational and Career Education, Personality 
Characteristics (i.e. goals in the affective domain), dnd Physical Education. 


After decisions were made about evaluation of.subtests, about assignment to 


. grade level, and abeut categorization. into goal area, the tests were evaluated on 


39 criteria of test quality. The 39 criteria were grouped into four broad areas: 
Measurement Validity, Examinee Appropriateness, Administrative Usability, and Normed 
Technical Excellence (yielding the acronym MEAN evaluation system). These criteria 


were applied only to the materials provided by the test publisher or distributor. 


‘form. ’ Every test, Mas independently rated accg ding to the MEAN system by at least 


For each test or subscale that was evaluated, the reviewer used a standard rating | 
two taters, each: working without access to the other's patings: The final adjudi- 
gation of test assignment to saa area and q dit sikoutton of the 39 quality ratings 


were both performed by an additional rater All raters had the same information on 


each test--a standard specimen set consist hing of the test‘itself ‘nh, in some. 


, 


_cases, a ener manual or other ‘pes Jof supporting information. ' 


It is inportant to point out that eitied was applied in considering support - 
ing fierce on all tests. Thirteed of ahs 39 MEAN criteria deal with empirical 
aspects of tests, mostly related to yAliditypand reliability. For these criteria, 
two rules were devised: The student samples used in generating empirical data must: 
(1) contain some students in at 1 ast one Of the two grades aoe a given evaluation 
rs 9-10, 11-12) and, (2) must igclude students at, but not more than one-grade 


‘4 


level above or below these gradeg. 


Using these rules, a test being evaluated for ne 
Grades 9-10 wend receive credit) for validity or reliability criteria - if student ” 
samples contained any grade comyination that included grade 9 and grade 10, but did 

not include any students at grade 7 or below or grade 12 and above. 


The practical effect of these rules was to downgrade those tests where care 


~ 


‘was not taken in reporting data or in planning validity and reliability studies. 
A me of tests. had "high school" forms in which a mix of students from all 


grade levels-of high school were used in test development. Such data were not 


P< U ° ‘ 
credited, For example, the data for the grades 9-10 evaluation did not receive 


credit because grade 12 is more than one grade above grade 10. Similarly, the 


‘data for grades 11-12 were not credited since grade 9 is more than one grade below 


11. 6 


Test Evaluation Personnel 


¥ 
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All test. evaluations were performed by indiyidu Is;trained in educational test- 


ing. The majority of test evaluators re ee an MA or:a Ph.D. in education 


oe 


or psychology. 


Goal: Area Selected fdr Study 


are as follows. 
Goal 6 A 
Reading Comprehension Skills 
Identifies the main idea and important details; determihes the meaning of words 


from the way they are/used; applies the reading technique Sppropeseys to the 
subject matter. Draw inferences from material read. 

Goal 25 A 

‘Comprehension and roduetion/of Information (Vocabulary) | 


/ 


Has a broad vocabjilary. Produces needed information and jabstract ideas. De- 
scribes pictures jor sounds and illustrates ideas with other ideas. 


As might be discerned from a careful reading of goal 25A, this is a broad aredoue 
covering some’ measures labeled "i telligence" tests. However, the majority of 
tests falling in the goal area were traditional ocaiuiiey” tests, for: — the 


vocabulary aikeaees of achievenent test batteries. The tests eliminated tron the 


goal area for this analysis were ba Gilliard Learning Potential Examination 
(Picture Completion Subtest), the \scales of the Goodenough-Harris Drawing Test, 
the Hiskey-Nebraska Test of Learning Aptitude (Completion of Drawings Subtest) 


and the Mathematital and Technical Test (Completing Pictures Subtest). What 


remained were a large number of measures that all had as their objective a tapping 


. 1 e ’ 
of student skills in determining word meaning. é ‘ 
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Tests chad Sele ae grades 9-12) were analyzed in this study. 


For Reading Comprehe sion 352 test and subtest evaluations were analyzed; for 


Vocabulary, the total number was 373. 


Results 
' Several criteria in the MEAN system dealt with the quality of a test's item 
selection procedures. Table-1. shows how tests fared on two separate criteria in this 
‘ei For 51% of the Reading Comprehension tests and 65% of the, Vocabulary tests, 
no information was given by ‘the publisher on item selection procedures. It was 
impossible to determine from where items were Seshvei--eantboakss curriculum plans 


or some other source were not cited. For the criterion related to empirical item 


selection procedures--fewer than 10% of tests proVided evidence that procedures like 


oy" 
c 


item analysis or criterion groups analysis were used. ; , 


Table I , J 


Numbers and Percentages of Tests Rated for 


lity of Item Selection «fF 
ry 


Reading Comprehension Vocabulary 
N 4 : N % 


Item Selection Sources” 


_. Detailed Description 
‘of Item Selection 63 18% 24 6% 


Statement Made — ; Stee Fe 
on Item Selection ‘110 31% - 107 29% 


No Information - 
on Item Selection . 179 51% 242 . 65% 


‘ 


Empirical Procedures 
for Item Selection , . ' . 


Evidence of ‘ 
Empirical Procedures 20 : 6% 35 9% 


No Evidence of 
Empirical Procedure$ ~- 332 944 338 91%, 


- 


& r ‘ 
‘ ‘ , . ON 
The evaluation system covered several areas in construct validity. The latter 


term has special salience, of course, in personality testing where a developer of a 
New measure might justify such a test. by empirically demonstrating its relationship : 
with some hypothetical construct. Nevertheless, the term. has meaning in achievement 


testing, insofar as such measures gain in.usability by demonstrating their indepen- 


dence from other measures and their "purity" of content (in the factor analytic 
: ; ; ee are 
sense). ; : eo OMe He sg Sa 


.¥ 


Table 2 shows that only about 1% of tests gave any ‘itarmation on divergent 
validity (low correlations with eiue! wewsnvess or reported evidence of using factor 
analysis in developing the test. Further, few tests (again, only about 1%) were 
‘reported as having. beer used in an experiment or an evaluation. A fairly large’ 
proportion of tests, however, (64% £ the Reading Comprehension and 50% of the 
Vocabulary) did give a statement je tifying the test's existence. All that was 
required was a sinple comment that showed that the dévelopers had some specified 2 
‘ educational, psychological or learning theory in mind when they devéloped the 


* 


instrument. 


! “ -Table 2° 


‘ Numbers and Percentages of Tests Rated for 2 
oS ~ + Aspects of Construct Validity 


Reading Comprehension Vocabula 
N : ¥% N : % 


= 


Divergent Validity Yes. 11 : 3% 4 1% 

' . Information Given No 341 97% 369 "99% 
Factorig} Validity Yes 2 1% 2 1% 
Information.Given No 350 99% 371 ‘99% 

. . SS 
“Experimental Use of Yes 2 1% 3. 1% 
Test Reported No 350 99% - 370 99% 
. _ . © . . - 
Theorétical Support _Yes 224 64% 188 50% 
: . Ss. 


For Test Given —=N . 128 - 366 (185 50% 


‘ 
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A.very important consideration for any test relates to its concurrent and 


predictive validity. - How well does a test relate to established measures or relate” 


to future outcomes? Table 3 displays how language teSts were rated on these criteria.’ , 


/ : 


About 15% of both Reading Comprehension and Vocabulary tests reported concurrent 
7 ; _ : 
correlations greater’than.70. The rest were bélow .70 for such correlations or 


veported no validity studies’ in this area. The situation was. worse in rated pre- 
| glictive validity. Approximately 90% of the tests did not wepent such studies or 
reported data that were not acceptable. Regarding the latter point., in both con- 

sasient and predictive validity, test evaluators judged the quality of the criterion. 


4 2% . i? 
If the criterion--a test or a measure of success at something--was patently irre- 


’ 


levant or unrelated to the goal‘area of the evaluated test, the test was not credited. 


> 


~- 


% Table 3) : 


ge Number -and Percentages of Tests Rated for 
: ‘ -__-_ Concurrent Validity and Predictive Validity ‘ 
A a . "- Readin Comprehension - Vocabulary 
. ’ = ‘ < 
" Concurrent Validity : ty L 
3 Studies referred to $3 15% 55 15% 
ee Se a ee 
“«* Studies referred to 27 a. 12 - 3% 
‘130 < r€.70 ; 
' No Studies referred to 272 77% (3086 82% 


-' " Predictive Validity. ~ : . \ 


: ‘2 >.70, Relevant criter- 0 7 0% 0. 0% 

we . fa, Interval of >1 
_month, cross-validation : 

,  « * shrinkage < 10% ; f° 


-¥ >.70, Relevant criteria, 7 2% 6 2% 
Interval of > 1 month . 


_.30€f€.70 or Questions 29 Bt 22 6% 
-sable Criteria 


No study performed or * 316 90%. 345 92% 
' Irrelevent Study : 


a’ 
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An critical element of quality in any standardized test concerns its reliability. 
How consistent. are scores obtained by students? Table 4 shows ratings in three 
" types of test reliability® The’ cates of results was remarkably similar for both 
Reading Compretiension and Voeathdiees: Tests were strongest in internal consistency. 
About 20% of vated tests had coefficients siioive -70. With alternate form reliability 
about 10% were above .70, while only 5% of test-retest correlations sxcteded this 


a 


benchmark figure. 


For test-retest reliability, areas if the time span between 


testing was one month or more. Retesting with }he same form or delayed alternate 
form testing were both acceptable. Regarding the criterion of interttal- consistency, 
split-half, Kuder-Richardson, or alpha coefficients were all accepted as yevidence. 


For alternate form reliability, either immediate or delayed testing was credited. 


‘ 


Table 4 


e 


6 Numbers and Percentages of Tests Rated 


a F for Common Types of Reliability ’ 


bt ‘ 
a ,t 


+t 


. : 38 Reading Comprehension Vocabulary 
* < : : N : ; N 


. 


Test-Retést 


Coefficient : 
r> .90 _ 2 1% 8 - 2 
80 gre .90 4 ° 8 2% 
“170 € T< +80 go Age 3 1% : 
i. os 
r<.70 340 968 354 95% 
Internal Consistency , ~. 
Coefficient mn ‘ 
r>.90 « 37 11% 50 13% 
80Er<.90 om: 68. 2s 7% 
> 70 ¢ re .80 ; a oe It 3 1% 
{ 2 ° 
-~TrZ .70 ; 290 | 82% 295 7m“ 7 . 
. . me 
. : : * = , 
Alternate Form . . ; 1 
Coefficient : rm 
a's Pp .90 10" 3 19 | st af 
80 & r¢.90 2002~CO 6% hs 68 iy 
.02re.80 # ; 12 3 3 1% x 
" Te .70 310 88% 330 88% La 
° < « ’ > 
. : } 
15 ° 2 
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fy ROEARG TAT On was! given during test evaluation procedures of various factors 
of a test's administrative usability. * Validity and reliability are Amportant, but 
do not teld the whole story. Several criteria related to test interpretation are 
listed in Table 5. a 
ae i larger iia lia of Vocabulary tests than Reading Comprehension (98 vs. 
8%) i a wide norn range. This norm range criterion was applied ‘to deterhine 
if. tests Se restricted in range. The latter parm cin if the upper and lower limits 
of: the Norm group were less than two years beyond the levels for which the test was 
evaluated. For example, a test evaluated for grades 9-10 having no 8th or 12th | : 
graders in the norm group was judged restricted in range. 
“Again VORRNAL Ey tests shaved superiority in score*iterpretation--75% had 
common Jaivarned scores, 48 stated to 57% of Reading cseicilosida tests. A | 
surprisingly large riibantigh of both types of tests had novel scores, ambiguous 
scores, or not converted scores at all. 
_ For the remaining three criteria in seore interpretation, results were similar 
for tests in the two goal areas. Both types had tests with relatively straight- 
“forward procedures for conversion ares raw score to converted score, both had 


e 


tests with not nationally representative norm groups and both had a majority of 


. 


instruments being capable of interpretation by school staff members. . 
7 . ¢ . 


a 


‘ 


h 


j ee She ry AL os + 
/ | Table S a 14 


Numbers and Percentages of Tests Rated on 


Ul 


F -_f wae Criteria Related to Test Interpretation . 


_ Reading CompreKension Vocabulary 
: N N % 


Norm Range i 


_ At /least 2 years prt 27 Peal “8% 71 19% 


_ Restricted range , sage 92% 302 81% 


Score Interpretation 


| or ‘ i Z . 
Common and simple ‘ 202 57% 279 75% 


converted scores - 


‘Novel, ambiguous; or 150 438 94 25% 


No converted scores 


“Score Conversion 


Simplé or no ; 247 70% 267 72% 
conversion — é , . , 
Poor Tables or 2 102 , 29% ~ 99 27% 


gtep conversion . 


Complidated conversion ro eae es % 7 1% 


i, ct ‘ 
we 


®Common and imple were: pass/fail, percentile ranks, ‘nental ages, deviation IQ’s, 
and grade equivalents. 


x 


& 
byationally representative meant having at least four of the following attributes: 


(3) all areas of U.S. sampled; (4) appropriate age range represented and exhausted; 


(5) racial/ethnic representation or separate norms for such groups; (6) urban, 
subyrban, and rural sampling. 
Le ‘ 1 7 P ° x 


~To elaborate on several areas related to norm samples and quality of scores, 


three criteria dealt with these topics in depth. Table 6 gives percentages rele-~- 
vant to such concerns. It was found that 70% of Vocabulary tests, but only 57% of 
“Reading Comprebension tests had replicability of standardization procedures, This 
meant that procedures of administration, scoring and interpretation were sufficiently 
standardized $o that penitex could be duplicated from the norm group. About half - 
of the tests gave no information on score distributions or reported. badly skewed 
SeuceiBatiens: Thirty percent or more of tests in both areas had well drawn out 
score distributions. P , . 

About half of the tests considered had some type of fairly well graduated 


converted scale. But a dismayingly large number had crude graduation or a type of 


novel scale that most. test users would not be familiar with. 
a 
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Table 6 


Numbers and Pexbentages of Tests Rated on 
Replicability of Standardization Procédures, Range of 
Coverage and Quality of Score Graduation . 


Reading Comprehension Vocabulaty 
N i 3 ae N 


Can the testing procedure be 
duplicated? Are procedures 
of administration, scoring, 
and interpretation stand- . 
ardized?. __ , : ~ 


‘ Yes 202; 57% 262 70% 


Does the rest have an ade- , : x 
quate range of coverage? 

(high ceiling, low 

floor, symmetrical dis- 


tribution) 


Tails of distribution 
‘drawn out, floor or 


ceiling not reached 0S 30% 142 - 38% 
One tail of distribution e 
drawn out, floor or : ‘ 

‘ceiling not reached 2 6%, 14 4% 
Floor er ceiling reached 29 ra 23 6% 
No’ information on score i j 
distribution or badly é 
skewed 197 56% 194 52% 
Quality of Score Graduation 
Percentiles, grade equivalents, ? 
or mental ages - 132 38% * + 156 42% 
Deciles, $tanines, T-scores, _ . 
or z-scores ‘ con . 26 7% 60 16% 

; | Pass-fail, quartiles, or . - 
novel scales. 194 “SS% 157 42% 
d ’ 
19 


The last criterion of test quality focused-on how well the test helped a 
user make a decision about the test faker. Tests were cated high if they gave 
prescriptive information on a student (e.g. information associating a score with 
some educational placement decision). Table 7. shows that 90% of tests had, at . 
baat, poor guidelines fof decisions. They had, for the most part, little information ~ 


to a ifig.of score to be translated into some action. 


’ . ¢ * 


Table 7 


a 


Number and Percentages of Tests Rated on 
Decision-Making Utilit ‘ 
4 i o . 


Reading Comprehension Vocabulary 
N = 5 N % 


Does the test provide 
information usefyl for 
making any individual 


or group decisions? ; rae 


4 * 
Definite, prescrip- ° 0 - 0% 1 gis 
tive decisions : 


Suggestive deci- . 30 9% : 34 . 

sions ¢ 

Poor guidelines * is: * 33% - 104 28% 

for decisions , 
“Little or no in- 207 al: a |” 63% ; 
formation for decisions ‘ j 

; fens Ee 
* 
A . 


u é : Discussion pt \ 
¥ r 
Differences between Reading Comprehension and Vocabulary. ; ; 

It might be well to first review the findings with respect to differences 
between’ tests in Reading Comprehension and Vicatulaxy. Although the gabioke Cas 
of percentages-on-the various criteria was similar for the two goal areas, 

“ft , in some cases substantial differences wefe found. | , 
‘ On the very first criterion considered, sources for Item Selection, 
a markedly largef proportion of tests in Reading Comprehension rather than 
Vocabulary (18 versus 6 percent) had a detailed description of Item Selectiox \ 
procedures, The latter meant that the-publisher provided a statement on where 
items came from or what resources (e.g. curriculum guides} were used in. 
aeeiving at an initial pool of questions. . 
The differences between the two goal areas may have reflected the ‘ 
necessity, in the — of Reading Comprehension, to clearly describe the 
type of source material and: alert the test purchaser to the specific content 
areas from.which reading passages would come. Such a justification was. ° 
i perceived as, perhaps less necessary in Vocabulary. In selecting vocabulary 
items; some publishers may have simply used item tryout information to 2 ° 


. eliminate extremely difficult and easy words from some arbitrary starting 


~ * Tist ( wo 8 . 


In another point of: contrast, 64% of the Reading tests and only 50% of ‘ 
pt the Vocabulary tests reported theoretical support (see table 2). This 
s . : . * . 
a » meant that -more ofthe Reading tests gave some sort of statement of rationale-- 


some defense for the sts existence. The reasons may well have been the same — 


. _ ‘ 


. 


as the relative superiority of Reading tests on the first criterion. 


‘ 


_ It may have been simply that authors of Vocabulary Tests felt less need for 


a 


any justification--under the assumption that the utility of knowing some- 


‘thing about a ‘student’s knowledge of Vocabulary is obvious. 


Two criteria on which Vocabulary tests made a better showing than Reading - 


yore Norm Range and Score Interpretation (see table 5). Nineteen percent. 
4 


: of, Vocabulary as against atang percent of Realling tests had a norm range of - 


_ at least 2 years. The lated meant that norm groups had upper and lower linits . 


at least 2 years beyond the levels for which the. test was evaluated (i.e. 
levels 9-10 or 11-12). Moreover, a fairly large discrepancy existed between 
the two goal areas on the dimension Score Interpretation. For Vocabulary, 
fully 75% versus 57% of Reading Tests had common or simple converted seus — ; 
(e.g. percentiles). In other words, a surprisingly large proportion of 43% - - 
of the Reading tests had novel, ule phase or no converted scores. 
Vocabulary tests were relatively less superior on the criterion Score -* 
Interpreter. Twelve percent of them required a specialist to interpret, but 
only 2% of the Reading tests required a specially trained score interpreter. 
These sindings, the relative superiority of Vocabulary tests in Norm 


Range and Score Interpretation and their relative inferiority on the Score 


i » 


Interpreter criterion, may be related to the varied uses of the two types” 
of tests. At least some of the Vocabulary tésts came from batteries where 
they played the role of an a decal measure, Given.the sssoxketton of 
vocabulary knowledge with IQ eer terms of often reported overfapping. variance) 


it may very well be —_ Vocabulary tests shared some aspects with 1Q measures e 


* that os tests did not. This slid explain the superiorfty of-tests” 


¢ 


an foedbulary in Norm Range (the use of a wide age span al norms) ind Score 


® w. . 
. 22 ? + 
4 . 
* . 


Pas} 


. 


2 Interpretation (the use of common converted scones) and the grpater sta 


of a Vocabulary test requiring a specialist score interpreter. 

The last criterion in which Reading Comprehension and Vocabulary differed 
was in athe area Replicability of. Standardization Procedures. ‘Vocabulary had 
an advantage here. . Seventy percent of its tests had replicable procedures-- 

. only 57% HN tests were so judged. This area related a whether the 


» test provided formity:. of procedures for administering wit sedting and 


whether the test user could use the test with samples similar to the standard- : 


ivation group. “In other words,were the circunstances in standardizing the test 
similar to those faced by a test user in testing a typical group of students? 


The eneral ‘results and their implications 


The enue, despite some points oral: contrast, were fairly consistent ° 
bit the two educational areas across the various criteria. These results have 
2 
some iiplications for researchers and other test users. *° 


1. the finding that few tests gave information on item selection 
reinforces the importance of the researcher carefully looking over 
. the items themselves. One cannot expect much guidance from 
publishers on sources for items and, furthermore, any general state-. 
c” ™~ ments about such sources may not be useful for many users. A 
* content analysis is de ri eur. Unfortunately, 85% of Reading 
.and Vocabulary tests reported low correlations for concurrent 
validity or reported nothing at all: The relationship between 
many little known Reading and Vocabulary tests aa established 
- measures is unclear. é 


Zs A ‘few tents had for a well specified age range, very high 
reliabilities (i.e. above .90), but the great majority did not. y 


xs. The researcher is fortunate if a test with high reliability might | 


have enough other requisite characteristics that it can be used in 
a given research circumstance, If a test with less-than-optimal 
‘ reliability must ad used the following considerations might be kept . 
. in mind. - ’ ! 


1 me ee 


’ - eo %s ma 4 : i 


a) the researcher may often have to estimate reliabilities: for 
a given (narrow) age range. Too often, publishers perform 
a reliability studies with aes having wide age ranges. ' 


b) Some thought might be given to performing $mall scale’ relia~ 
bility studie$,-especially for special student populations. 


c) Many tests had reliabilities. below. .70. Researchers using 
tests for evaluation purposes should be sensitive to, problems 
of internal validity bias due to instrumentation error 
(Campbell and Stanley, 1964). A‘test with .70 reliability is 
one in which only about 50% of the variance is shared for the two - 
scores (i.e. in alternate form and test-retest situations). 
Serious thought should be given to some measurement strategy 
i‘ that optimizes inferences about a program or treatment under 
study. Using more than one measure--the method of "converging 
operations" (Webb, Campbell, Schwartz, Sechrest, 1966)--is 
one such strategy. & 
‘ 3. Few tests gave clues*as to what decisions ceuld be made about individuals 
, or groups based on test scores. This points up the necessity for 
‘ thinking out in advance exactly how scores will be used. Reading 
and Vocabulary tests are useful for program evaluation purposes. 
' = For example, standardized secondary level tests in reading and 
other skills are being used to evaluate success of the Emergency 
School Aid Act (ESAA) program. But their utility for other 
purposes might at time be questioned. This is#rue especially : 
given the fact that few predictive validity studies were identified 
for the tests examined in‘this study. There was little data 
on how tests related to such things as job performance or grade 
point average for the first year of college. 


aad 


A Concluding Note 


. There were several limitations to this study. One concerns the procedure 


, . *. . ¢ et 
of acquiring and categorizing tests. Virtually every test on the market was 


‘obtained and evaluated. This meant that some rather obscure instruments hs “es 
Y were given ratings along with very well known tests. There is some justi- Bye. 
fication for this, however. Tests are on the market because enough people 


buy them to allow a profit for the publisher. In the absence of inforgation 


af: * 


on how many tests of @ given type are sold, it isa fair .~- 


» 


. at a higher than chance rate by subjects who do not read the passage with 


‘ 


assumption that hundreds (more realistically, thousands) of copies are sold 


every year of even little known instruments. It is a defensible proposition. , a 


we « r ‘ 


‘ 


that theas“keset should be. evaluated. 
Another point to be made on this study's test evaluations is an-issue 
related to the evaluation criteria themselves. The criteria were general 
seks were applied to tests in every subject domain , (the eee — by 
Hoayenee ‘et al., 1974, itsts 298 goal areas into which costs were categorized). 
The test purchaser: and researcher should be aware of special criteria aimed 
at Reading Ciaiiircinhencanl Vocabulary tests exclusively. Such criteria 
were not included in the present report .or the source data fron which it was 
derived. But researchers should be cognizant of special problems with . , 
language oriented tests. . Probably the most significant of these is the 
passage dependerice of Reading Gomprelnenion tests. Tuinman (1943-1974) ~ 


found that .some items in Reading Comprehension tests are not dependent on 


Hiei pastes of prose that they follow. Such items are answered correctly Ps 


whitels the tens are ostensibly linked. “Needless to say, this weakness in . 
‘ . 


measurement ag to be noted by a prospective test user. 


] , 


. 


" patrsdike,: J.M. & Hufano, L. An asséssment of the quality ‘of high aahiodl 
mathematics tests. Paper presented at. the annual meeting of the. ., 
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